The data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
Reference: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Input variables (based on physicochemical tests):
Output variable (based on sensory data):
We use ggpairs on a subsample:
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
## Using as id variables
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
We notice that vast majority of wines were assigned a rating between 5 and 7. There are no wines with ratings of 1, 2 or 10. It might be useful to combine the ratings and form 3 groups [3,4], [5,7] and [8,9]. We use “cut” to create a new variable “quality.joined”.
We are interested in identifying the chemical properties of the white wines that could have influenced the quality rating. We will try to detect relationsships between the rating (variable “quality”) and the variables describing the chemical properties.
From the pairwise plots we can get an overview of the data:
There are a few obvious interpendencies between other variables (e.g. alcohol and density, residual.sugar and density). These might also help to eliminate outliers. Further, (high) quality is not influenced by a single variable but rather a (optimal?) combination of chemical properties.
Most histograms show very symmetric (Gaussian) behaviour with a few potential outliers. Alcohol and residual sugar are a little more skewed. Chlorides is also very symmetric and peaked around 0.04 but shows quite a few values above 0.1.
For some variables, we probably want to delete some outliers. This will be investigated next.
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 1654 1654 7.9 0.330 0.28 31.6
## 1664 1664 7.9 0.330 0.28 31.6
## 2782 2782 7.8 0.965 0.60 65.8
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1654 0.053 35 176 1.01030 3.15
## 1664 0.053 35 176 1.01030 3.15
## 2782 0.074 8 160 1.03898 3.39
## sulphates alcohol quality quality.joined
## 1654 0.38 8.8 6 (4,7]
## 1664 0.38 8.8 6 (4,7]
## 2782 0.69 11.7 6 (4,7]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
While the high density of 1.03898 is an outlier, it is plausible as it has an extremely high amount of residual sugar.
However, with increasing resiudal sugar the influence of other variables should become weaker. We therefore discard the wine with the highest residual sugar.
High quality wines tend to have higher percentages of alcohol.
Residual sugar doesn’t show a clear influence on wine quality. As with many other variables, a certain level of a chemical can result in very different quality ratings.
Next, let’s look at acidity. We expect low ratings for high values, as too much acidity leads to a vinegary taste.
We see that for values up to 0.7, there is no clear influence on the wine quality. Ratings seem to decrease for values higher than 0.8. But there are only a few data points, so that we can be not sure about a true correlation. Nevertheless, it makes sense to break acidity into groups, especially because we can assume that hitting a certain high level (maybe not reached in the dataset) will eventually have a bad influence on the wine quality:
## [1] "Next, we investigate the second acidity variable:"
## [1] "We delete the data point with the highest fixed acidity because it is the only wine with a acidity in this range."
## [1] "One more variable about acidity: citrc acidity"
The investigation shows an overlap in quality for different levels of citric acidity. High quality seems to be associated with a smaller range of citric acidity.
Now, we turn our attention to chlorides and sulphates:
## wwq$quality.joined: (0,4]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01300 0.03750 0.04600 0.05056 0.05400 0.29000
## --------------------------------------------------------
## wwq$quality.joined: (4,7]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04588 0.05000 0.34600
## --------------------------------------------------------
## wwq$quality.joined: (7,10]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01400 0.03000 0.03550 0.03801 0.04400 0.12100
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 2523 2523 7.3 0.17 0.24 8.1
## 2526 2526 7.3 0.17 0.24 8.1
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 2523 0.121 32 162 0.99508 3.17
## 2526 0.121 32 162 0.99508 3.17
## sulphates alcohol quality quality.joined
## 2523 0.38 10.4 8 (7,10]
## 2526 0.38 10.4 8 (7,10]
No apparent findings on sulphates. Next, sulfur dioxide:
## Warning in loop_apply(n, do.ply): Removed 8 rows containing missing values
## (stat_smooth).
## Warning in loop_apply(n, do.ply): Removed 8 rows containing missing values
## (geom_point).
Wine quality doesn’t seem to vary in any simple way with one of the variables describing a specific chemical attribute. For most plots, we notice that low quality wines show a wider range (containing the smaller range of high quality wines) of values. While it might be hard to determine what makes a high quality wine, this could help determine when a chemical property becomes so extreme thatresults in a bad taste. (Best example would be acidity.) In some cases, we might be able to construct a weak linear relationship by transforming a variable, see sulfur above.
We used the physical (nearly linear) relationsship between alcohol/sugar content and density to determine extreme values.
The strongest (and also most obvious) relationship is the one between residual sugar and density. Also, alcohol and density are strongly correlated, even though residual sugar has the stronger influence (as “adding” alcohol can only lower to density to density of alcohol itself).
Relationsships between chemicals and wine quality are rather weak.
The first plot is similar to the one above (after eliminating the outlier).
The scatter plot of fixed and volatile acidity supports our hypothesis that too high values of acidity (for at least one of the variables) might be correlated with lower scores (red).
We create two new variables:
In general, there are no clear correlations between wine quality and its chemical properties. Our visualizations suggest that only extreme values (e.g. for acidity) may influence the rating in a negative way. The dataset is problematic as most wines are assigned a moderate rating and not much can be inferred about low or high ratings (high ratings, of course, are of particular interest).
No. Strong relationsships can only be found among chemical attributes (e.g. density and residual sugar). No surprises here.
We created a simple tree model. However, relationsships are so weak that the model only makes use of two variables (alcohol and volatile acidity) and only assigns ratings of either 5 or 6.
Most white wines obtain a rating between 5 and 7. Only very few ratings of 3 and 4 or 8 and 9 are assigned. There are no ratings less than 3 and no wine is rated 10. As most wines are of medium quality, it will be hard to determine what chemical properties are related are typical of high quality wines (if possible in the first place).
The regression tree model results in a very simplistic structure with only two distinctions: If the alcohol level is below 10.85% and the volatile acidity is higher than 0.2525 g/dm^3 the wine will be assigned a rating of 5 (upper left area). In all other cases its rating will be 6. We added color and let size of the dots increase with quality. It seems like the wines in the upper left rectangle are (on average) of higher quality than the ones outside the recatangle. In fact, the tree method gives a mean of 5.361 for the wines in the upper left are and a mean of 6.131 elsewhere. Nevertheless, a very disappointing result. In general, no strong relationsships between wine quality and its chemical properties could be found.
We combine several variables to visualize to get a more complete view of the data: On the x-axis we multiply the level of sulfur dioxide (an antioxidant) with the amount of citric acid (normalized per dm^3) as meausre of freshness. On the y-axis we multiply fixed and volatile acidity levels, which in high doses can lead to a vinegary taste. Further, we use our results from the tree regression models which suggests discriminating wine quality based on the alcohol content. The color distinction shows that high quality wines (the right facet) are more often found alcohol levels over 10.85%. The opposite holds for wine of quality 3 or 4. Also, high quality is more often found with high “freshness” (keep in mind the log scale on the x-axis). Most wines experience low “acidity” levels. However, higher levels are more often found for medium or low quality wines.
The dataset contains almost 5000 white wines that were rated by three experts. Eleven chemical attributes like sulfur content, pH level etc. are listed.
Only weak relationships between the quality and the chemcial attributes could be found. This is little surprising because we can hardly expect to model (the only little understood and very complex sense) human taste with only eleven variables. Some variables are strongly correlated (e.g. density and alcohol content or amount of residual sugar). A tree regression model was applied but could only provide little inside. One problem might be that most wines are of medium quality (with ratings between 5 and 7). The dataset contains only few wines with ratings of 8 and 9. This makes it hard to make inferences about high quality wines. Also, combinations of variables could only slightly improve the situation. What we can say is the rather trivial fact that certain chemicals in very high doses (e.g. volatile acidity) are likely to have negative influence on the taste. If a wine shows only moderate chemical attribute, nothing can be said about its potential rating. As guideline, our visualizations showed the following:
might be more likely associated with high ratings, however, these are far from being sufficient criteria.
For a better understanding of wine quality more chemical properties are needed.